MENU

Here, I want to discuss the main probability distribution (based on my humble knowledge). Probability is the area that I am so fascinated with because there are many applications in several science topics. The principal probability distributions necessary to understand the whole process regarding inference and applying statistical models are Bernoulli, Binomial, Negative-Binomial, Poisson, Normal, and Gamma. Of course, there are many other essential distributions that I am not to discourse here. I will try to explain the support and parameters beyond the idea behind each one.

Bernoulli distribution

The first one is the most famous distribution, is the Bernoulli distribution. Let X a binary random variable with probability density function (PDF) f_{x} . Then, X \sim Ber(p) has PDF

f(x) = \mathrm{P}(X = x) = p^{x} (1-p)^{1-x}

where the support is X \in \{0, 1\} and parametric space is p \in (0, 1). The expected value (mathematical expectation) is \mathrm{E}(X) = p and variance is \mathrm{Var}(X) = p(1 - p).

I will not discuss moments in statistics here where the first moment is mathematical expectations and the second is related to variance. Nevertheless, Wikipedia is a good site where you might start to study more about this topic. I love this concept because everything concerning statistical models is linked to a mean, mainly in generalized linear models (MLG). But it is a topic to see forward.

Bellow, there is a code about fifteen realizations from Bernoulli distribution. You can see that there is a chart, where the x-axis is X = 1 and X = 0, and y-axis is \mathrm{P}(X = 1) and \mathrm{P}(X = 0), respectively. And other propriety that we need to have in mind is \mathrm{P}(X = 1) + \mathrm{P}(X = 0) = 1.

set.seed(123)
value <- seq(1e-06, 0.999999, by = 0.001)
p <- sample(value, size = 15, replace = TRUE)
q <- 1 - p

data0 <- cbind(`X = 1` = p, `X = 0` = q)

barplot(data0, beside = TRUE, main = "Bernoulli distribution",
    xlab = "Realization of the variable",
    ylab = "Probability", col = rainbow(15))

The Bernoulli distribution has a huge spotlight in many areas, mainly because of its applications. For whatever response variable you have whose response is good/bad or two options, the model logistic regression will be the model appropriated to study.

Binomial distribution

The Binomial distribution is essential primarily due to its application in the experimental area of health, agronomy, or other sciences. Let X a binary random variable with probability density function (PDF) f_{x}. Then, X \sim Bin(n, p) has PDF

f(x) = \mathrm{P}(X = x) = \binom{n}{x} p^{x} (1-p)^{n-x}

where the support is X \in \{0, 1, \dots, n\} – number of successes and parametric space is p \in (0, 1) success probability for each trial and n \in \{0, 1, \dots\} - number of trials. The expected value is \mathrm{E}(X) = np and variance is \mathrm{Var}(X) = np(1 - p).

For this last one, n, I would prefer to treat it as fix value than a parameter. It happens because you will have always been with this value previously. And the concept of the parameter is to estimate from the sample and not to have it before. It might have the same idea or be called a hyperparameter, such as machine learning techniques.

The graph below shows us how different p could affect the density curve of Binomial distribution.

par(mfrow = c(2, 2))

n <- 30
success <- seq(0, n)
prob <- c(0.2, 0.4, 0.6, 0.8)

for (i in seq(1, length(prob))) {

    set.seed(123)

    dens <- dbinom(success, size = n,
        prob = prob[i])
    name <- paste0("Binomial Distribution (n=",
        n, ", p=", prob[i], ")")

    plot(success, dens, type = "h",
        main = name, ylab = "Probability",
        xlab = "Successes", lwd = 3)
}

The model that comes from this one is the dose‐effect model used in an experiment, and this idea comes from GLM as well. I will explain this model in the future-forward and link it here as it is ready.

Poisson distribution

When you have to analyze data that the response variable is a positive discrete variable, the Poisson distribution is the best distribution to begin your study. Let X a discrete random variable with probability density function (PDF) f_{x} . Then, X \sim Pois(\lambda) has PDF

f(x) = \mathrm{P}(X = x) = \frac{\lambda^{x} e^{-\lambda}}{x!}

where the support is X \in \{0, 1, 2, \dots\} – number of successes and parametric space is \lambda \in (0, + \infty) rate. The expected value and variance is \mathrm{E}(X) = \mathrm{Var}(X) = \lambda. Below is a code to generate data from Poisson distribution in the R program.

## Package
library(vcd)

## Poisson
set.seed(123)
n <- 500
lambda = 4
x <- rpois(n = n, lambda = lambda)

var(x)/mean(x)
## [1] 0.9989307
## Histogram to discrete data
result.prop <- prop.table(table(x))
round(result.prop, 6) * 100
## x
##    0    1    2    3    4    5    6    7    8    9   10   12 
##  1.4  5.2 17.2 21.8 18.0 15.6  9.2  5.0  4.0  1.8  0.6  0.2
max.ta <- max(table(x)) + 5

bar <- barplot(table(x), xaxt = "n",
    ylim = c(0, max.ta), xlab = "x",
    ylab = "Frequency")
axis(1, at = bar, labels = data.frame(table(x))$x,
    las = 3)

## You might use 'type =
## standing'
gf <- goodfit(x, "poisson", method = "ML")
plot(gf, type = "hanging", scale = "sqrt",
    xlab = "Number of Occurrences",
    ylab = expression(sqrt("Frequency")))

As you see, the mean and variance have the same parameter, and this property is so crucial in many applications in the real world. When \mathrm{Var}(X) < \mathrm{E}(X) is called underdispersion, it is not common in Poisson models; however, there are some techniques to deal with it. Otherwise, when \mathrm{Var}(X) > \mathrm{E}(X) is called overdispersion, it is more common in Poisson models. Usually, the problem arises because there is an excess of zero. Exist many approaches to solve it, such as Negative-Binomial regression (more usual), Zero-Inflated Poisson Regression, and others.

Now, I will assume that \lambda \sim Gamma(\alpha, \beta). For the moment, I will say that the support of the Gamma distribution and the parametric space of the Poisson is the same. When I explain the Negativa-Binomial Distribution, this will be clearer to you.

set.seed(123)
n <- 500
k <- 10
theta <- 0.409
lambda = rgamma(n, shape = k, scale = theta)

## it is so similar lambda =
## 4
mean(lambda)
## [1] 3.986617
x <- rpois(n = n, lambda = lambda)

## overdispersion
var(x)/mean(x)
## [1] 1.343691
## Histogram to discret data
result.prop <- prop.table(table(x))
round(result.prop, 6) * 100
## x
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
##  3.2 10.6 13.2 18.8 16.2 12.4 11.8  7.2  3.0  1.8  0.8  0.2  0.2  0.4  0.2
max.ta <- max(table(x)) + 5

bar <- barplot(table(x), xaxt = "n",
    ylim = c(0, max.ta), xlab = "x",
    ylab = "Frequency")
axis(1, at = bar, labels = data.frame(table(x))$x,
    las = 3)

## Poisson In case use 'type
## = standing'
gf <- goodfit(x, "poisson", method = "ML")
plot(gf, type = "hanging", scale = "sqrt",
    xlab = "Number of Occurrences",
    ylab = expression(sqrt("Frequency")))

## Negativa-Binomial
gf <- goodfit(x, "nbinomial", method = "ML")
plot(gf, type = "hanging", scale = "sqrt",
    xlab = "Number of Occurrences",
    ylab = expression(sqrt("Frequency")))

What do you think about these results? So, it is similar to the results generated from the Poisson distribution with \lambda = 4. We should conclude that the Poisson distribution remains a good option instead of the Negative-Binomial distribution.

The \lambda was generated to create an overdispersion in the code below.

set.seed(123)
n <- 500
k <- 1.07
theta <- 4
lambda = rgamma(n, shape = k, scale = theta)

## it is so similar lambda =
## 4
mean(lambda)
## [1] 3.990251
x <- rpois(n = n, lambda = lambda)

## overdispersion
var(x)/mean(x)
## [1] 4.605163
## Histogram to discret data
result.prop <- prop.table(table(x))
round(result.prop, 6) * 100
## x
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 17.4 15.4 13.2 11.4 11.4  4.8  6.0  3.2  3.4  2.2  2.4  2.4  1.0  1.0  1.0  1.2 
##   16   17   18   19   20   22   25 
##  0.4  0.4  0.4  0.4  0.4  0.2  0.4
max.ta <- max(table(x)) + 5

bar <- barplot(table(x), xaxt = "n",
    ylim = c(0, max.ta), xlab = "x",
    ylab = "Frequency")
axis(1, at = bar, labels = data.frame(table(x))$x,
    las = 3)

## In case use 'type =
## standing' Poisson
gf <- goodfit(x, "poisson", method = "ML")
plot(gf, type = "hanging", scale = "sqrt",
    xlab = "Number of Occurrences",
    ylab = expression(sqrt("Frequency")))

## Negativa-Binomial
gf <- goodfit(x, "nbinomial", method = "ML")
plot(gf, type = "hanging", scale = "sqrt",
    xlab = "Number of Occurrences",
    ylab = expression(sqrt("Frequency")))

What must be seen regarding the overdispersion is that numbers of zero are responsible for this problem; thus, this one is equivalent to 17.4% of our database and is the number that repeats the most.

Another important property is the occurrence of the number of events in a specific interval of time, distance, area or volume, and it has a significant impact on studies. I am going to give two examples instead of a mathematical concept. The first one is you want to know if a particular highway improved in the accident numbers, in other words, if there has been a decrease in the number of accidents. Accident numbers are available at eight points on the highway before and after the improvements, during the number of years specified in each case. In this case, the number of years fixed is different for each sample observation, so you need to deal with an offset. The second example is you want to study the number of dengue cases in each city in a specific state, but you can not compare equal to equal because the population density is not the same. So, in this case, you have an offset again.

REFERENCES

Agresti, Alan. 2015. Foundations of Linear and Generalized Linear Models. John Wiley & Sons.
DeGroot, Morris H, and Mark J Schervish. 2012. Probability and Statistics. Pearson Education.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Rigby, Robert A, Mikis D Stasinopoulos, Gillian Z Heller, and Fernanda De Bastiani. 2019. Distributions for Modeling Location, Scale, and Shape: Using GAMLSS in r. CRC press.
Create a front page